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Abstract 



This paper introduces a new method for semi-supervised learning on high dimensional non- 
linear manifolds, which includes a phase of unsupervised basis learning and a phase of supervised 
function learning. The learned bases provide a set of anchor points to form a local coordinate 
system, such that each data point x on the manifold can be locally approximated by a lin- 
ear combination of its nearby anchor points, with the linear weights offering a local-coordinate 
coding of X. We show that a high dimensional nonlinear function can be approximated by a 
global linear function with respect to this coding scheme, and the approximation quality is en- 
sured by the locality of such coding. The method turns a difficult nonlinear learning problem 
into a simple global linear learning problem, which overcomes some drawbacks of traditional 
local learning methods. The work also gives a theoretical justification to the empirical success 
of some biologically-inspired models using sparse coding of sensory data, since a local coding 
scheme must be sufficiently sparse. However, sparsity does not always satisfy locality condi- 
tions, and can thus possibly lead to suboptimal results. The properties and performances of the 
method are empirically verified on synthetic data, handwritten digit classification, and object 
recognition tasks. 

1 Introduction 

Consider the problem of learning a nonlinear function /(x) in high dimension: x with large 

d. We are given a set of labeled data . . . , (xn,yn) drawn from an unknown underlying 

distribution. Moreover, assume that we observe a set of unlabeled data x e R'^ from the same 
distribution. If the dimensionality d is large compared to n, then the traditional statistical theory 
predicts over-fitting due to the so called "curse of dimensionality". One intuitive argument for 
this effect is that when the dimensionahty becomes larger, pairwise distances between two similar 
data points become larger as well. Therefore one needs more data points to adequately fill in the 
empty space. However, for many real problems with high dimensional data, we do not observe 
this so-called curse of dimensionality. This is because although data are physically represented in 
a high-dimensional space, they often (approximately) lie on a manifold which has a much smaller 
intrinsic dimensionality. 

This paper proposes a new method that can take advantage of the manifold geometric structure 
to learn a nonlinear function in high dimension. The main idea is to locally embed points on the 
manifold into a lower dimensional space, expressed as coordinates with respect to a set of anchor 
points. Our main observation is simple but very important: we show that a nonlinear function 
on the manifold can be effectively approximated by a linear function with such a coding under 



appropriate localization conditions. Therefore by using Local Coordinate Coding, we turn a very 
difficult high dimensional nonlinear learning problem into a much simpler linear learning problem, 
which has been extensively studied in the literature. This idea may also be considered as a high 
dimensional generalization of low dimensional local smoothing methods in the traditional statistical 
literature. 

2 Local Coordinate Coding 

We are interested in learning a smooth function f{x) defined on a high dimensional space W^. Let 
II • II be a norm on W^. Although we do not restrict to any specific norm, in practice, one often 

employs the Euclidean norm (2-norm): ||x|| = ||x||2 = '\J x\ + ■ ■ ■ + x^. 

Definition 2.1 (Lipschitz Smoothness) A function f{x) on R'^ is {a, (3,p)-Lipschitz smooth 
with respect to a norm \\ ■ \\ if 

\f{x')-f{x)\<a\\x-x% 

and 

\f{x') - fix) - Vfix^ix' -x)\< /3\\x - xT+P, 
where we assume a, /? > and p G (0, 1] . 

Note that if the Hessian of f{x) exists, then we may take p= 1. Learning an arbitrary Lipschitz 
smooth function on R*^ can be difficult due to the curse of dimensionality. That is, the number of 
samples required to characterize such a function f{x) can be exponential in d. However, in many 
practical applications, one often observes that the data we are interested in lie approximately on a 
manifold M which is embedded into M'^. Although d is large, the intrinsic dimensionality of M can 
be much smaller. Therefore if we are only interested in learning f{x) on Ai, then the complexity 
should depend on the intrinsic dimensionality of Ai instead of d. 

In this paper, we approach this problem by introducing the idea of localized coordinate coding. 
The formal definition of (non-localized) coordinate coding is given below, where we represent a point 
in by a linear combination of a set of "anchor points". Later we show it is sufficient to choose 
a set of "anchor points" with cardinality depending on the intrinsic dimensionality of the manifold 
rather than d. 

Definition 2.2 (Coordinate Coding) A coordinate coding is a pair (7,C), where C C R'^ is a 
set of anchor points, and ^ is a map of x ^ to [^v{x)\y^c £ i?'*^' such that Yly'^v{x) = 1. It 
induces the following physical approximation of x in W^: 

Moreover, for all x G W^, we define the coding norm as 
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The quantity will become useful in our learning theory analysis. The condition J2v ^v{x) = 
1 follows from the shift-invariance requirement, which means that the coding should remain the same 
if we use a different origin of the coordinate system for representing data points. However, if in 
practice we can find a good origin for the global coordinate system in W^, and if all points on M 
are close to it, then the shift-invariance requirement may become less important. 

Proposition 2.1 The map x — > '^^i=clv{x)v is invariant under any shift of the origin for repre- 
senting data points in if and only ifYlv^v{x) = 1- 

The importance of the coordinate coding concept is that if a coordinate coding is sufficiently 
localized, then a nonlinear function can be approximate by a linear function with respect to the 
coding. This critical observation, illustrate in the following linearization lemma, is the foundation 
of our approach. 



Lemma 2.1 (Linearization) Let (7, C) be an arbitrary coordinate coding on W^. 
{a, P,p)-Lipschitz smooth function. We have for all x G M*^.' 



Let f be an 



vec 



< a ||x — 7(3;) 



\v — 7(x) 



|i+p 



To understand this result, we note that on the left hand side, a nonlinear function f{x) in is 
approximated by a linear function ^^jg^* lv{x)f{v) with respect to the coding 7(x), where [f{v)]^^c 
is the set of coefficients to be estimated from data. The quality of this approximation is bounded by 
the right hand side, which has two terms: the first term \\x — 7(0;) || means x should be close to its 
physical approximation j{x), and the second term means that the coding should be locaHzed. The 
quality of a coding 7 with respect to C can be measured by the right hand side. For convenience, 
we introduce the following definition, which measures the locality of a coding. 

Definition 2.3 (Localization Measure) Given a,(5,p, and coding (7, C), we define 



a||x-7(x)||+/3El7.(^)l 11^-7(^)11'+" 



Observe that in Qa,i3,p, 01, {3, p may be regarded as tuning parameters; we may also simply pick 
a = (3 = p = 1. Since the quality function Qa,fi,p{l-,C) only depends on unlabeled data, in 
principle, we can find [7, C] by optimizing this quality using unlabeled data. Later, we will consider 
simplifications of this objective function that are easier to compute. 

Next we show that if the data He on a manifold, then the complexity of local coordinate coding 
depends on the intrinsic manifold dimensionahty instead of d. We first define manifold and its 
intrinsic dimensionality. 

Definition 2.4 (Manifold) A subset M. <ZW^ is called a p-smooth (p> 0) manifold with intrinsic 
dimensionality m = m[M) if there exists a constant Cp{M) such that given any x G M, there exists 
m vectors vi{x), . . . , 'u^(x) G M'' so that Vx' G M.: 



inf 



X — X 



E 



< Cp{M)\\x' -X 



|i+p 
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This definition is quite intuitive. The smooth manifold structure implies that one can approximate 
a point in Ai effectively using local coordinate coding. Note that for a typical manifold with 
well-defined curvature, we can take p = 1. 

Definition 2.5 (Covering Number) Given any subset M C W'-, and e > 0. The covering num- 
ber, denoted as M{e,M.), is the smallest cardinality of an e-cover C C M. That is, 

sup inf ||x — v\\ < e. 

For a compact manifold with intrinsic dimensionality m, there exists a constant c{M) such that its 
covering number is bounded by 

M{e,M) < c{M)e-"'. 

The usual statistical definition of dimensionality only involves the covering number |C|. However, 
the manifold intrinsic dimensionality is also important by itself in our analysis. 

The following result shows that there exists a local coordinate coding to a set of anchor points C 
of cardinality 0{m{M)Af{e, M)) such that any (a, /3,p)-Lipschitz smooth function can be linearly 
approximated using local coordinate coding up to the accuracy 0{^y m{^A)e^~^P). 

Theorem 2.1 (Manifold Coding) If the data points x lie on a compact p-smooth manifold J^, 
and the norm is defined as \\x\\ = {x Axf!'^ for some positive definite matrix A. Then given any 
e > 0, there exist anchor points C <Z M. and coding 7 such that 

\C\ < {l + m)N{e,M), 

QaMl, C) < [aCp{M) + + + 2'+PVm)P] e'+^ 
where m = m{M.). Moreover, for all x £ M., we have \\x\\^ < 1 + (1 + \fm)^ . 



The approximation result in Theorem |2.1| means that the complexity of linearization in Lemma 2.1 
depends only on the intrinsic dimension m{M) of M. instead of d. Although this result is proved 
for manifolds, it is important to observe that the coordinate coding method proposed in this paper 
does not require the data to lie precisely on a manifold, and it does not require knowing m(A^). In 
fact, similar results hold even when the data only approximately lie on a manifold. 

In the next section, we characterize the learning complexity of the local coordinate coding 
method. It implies that linear prediction methods can be used to effectively learn nonlinear functions 
on a manifold. The nonlinearity is fully captured by the coordinate coding map 7 (which can be 
a nonlinear function). This approach has some great advantages because the problem of learning 
local-coordinate coding is much simpler than direct nonlinear learning: 

• Learning (7, C) only requires unlabeled data, and the number of unlabeled data can be sig- 
nificantly more than the number of labeled data. This also prevents overfitting with respect 
to labeled data. 



In practice, we do not have to find the optimal coding because the coordinates are merely 
features for linear supervised learning. This significantly simplifies the optimization problem. 
Consequently, it is more robust than standard approaches to nonlinear learning that direct 
optimize nonlinear functions on labeled data (e.g., neural networks). 
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3 Learning Theory 

In machine learning, we minimize the expected loss (j){f{x),y) with respect to the underlying dis- 
tribution 

E^,,y(l){f{x),y) 

within a function class f{x) £ T . In this paper, we are interested in the function class 

^a,(3,p = {f{x) : (a,/3,p) — Lipschitz smooth function in M"'}. 

The local coordinate coding method considers a linear approximation of functions in J^a,f3,p on the 
data manifold. Given a local-coordinate coding scheme (7,C), we approximate each /(x) G ^app 

by 

f{x) f^^c{w,x) = ^w^'y^ix), 
where we estimate the coefficients using ridge regression as: 



Wv 



arg mm 



.i=l 



(1) 



Given a loss function (f){p,y), let 4>'i{p,y) = d<j){p,y)/dp. For simplicity, in this paper we only 
consider convex Lipschitz loss function, where \4>'i{p,y)\ < B. This includes the standard classifica- 
tion loss functions such as logistic regression and SVM (hinge loss), both with B = 1. 

Theorem 3.1 (Generalization Bound) Suppose 4>{p,y) is Lipschitz: \4>'i{p,y)\ < B. Consider 
coordinate coding (7,C), and the estimation method ([I]) with random training examples Sn = 
{(xi, yi), . . . , {xn, Vn)}- Then the expected generalization error satisfies the inequality: 



%„ ^x,y4>{f-y,c{w,x),y) 



< inf 



E 



,{f{x),y) + \Y,f{vf 



vac 



B 

+ — E^||x||2+SQ„,^,p(7,C7). 



If we choose the regularization parameter A that optimizes the bound, then the right hand side 
becomes 



inf 



E,^y^{f{x),y) + B -Y,f{vr^. 



\x\ 



+ BQa,f3,p{^,C). 



(2) 



In particular, if we find (7, C) at some e > 0, then Theorem 2.1 implies the following simplification 



for any / € J^a,i3,p such that |/(x)| < A for a fixed constant A, then the bound on the generalization 
error becomes: 

E,,y(t>{f{x),y) + ^e-"-(M)/n + e'+P . 
By optimizing over e, we obtain a bound: E^^ycj) {f{x),y) + 0(n-(i+P)/(2+2p+m(A^))). 



By combining Theorem 2.1 and Theorem |3.1[ we can immediately obtain the following simple 
consistency result. It shows that the algorithm can learn an arbitrary nonlinear function on manifold 



when n 00. Note that Theorem 2.1 implies that the convergence only depends on the intrinsic 
dimensionality of the manifold M. , not d. 
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Theorem 3.2 (Consistency) Suppose the data lie on a compact manifold M C M'^, and the 
norm \\ ■ \\ is the Euclidean norm in M"^. If loss function 4>{p,y) is Lipschitz. As n ^ oo, we choose 
a,(3 ^ oo, a/n,P/n — > (a, (3 depends on n), and p = 0. Then it is possible to find coding (7, C) 
using unlabeled data such that \C\/n — > and Qa,i3,p{'y, C) — > 0. // we pick An 00, and X\C\ 0. 
Then the local coordinate coding method ^ is consistent as n —> 00; 

lim Es„ E^^y(j){f{w,x),y) = inf E^,y(/> (/(x), y) . 

4 Practical Learning of Coding 

Given a coordinate coding (7, C), we can use to learn a nonlinear function in M*^. We showed that 
(7,C) can be obtained by optimizing Qa,p,p{n,C). In practice, we may also consider the following 
simplifications of the localization term: 

^\lv{x)\ ||w-7(2;)f'^^« ^\lv{x)\ 

Note that we may simply chose p = or p = 1. The formulation is related to sparse coding [i] 
which has no locality constraints with p = —1. In this representation, we may either enforce the 
constraint Ylv^'"^^) = 1 or remove it for simplicity (in such case, we assume that the coordinate 
origin is appropriately chosen so that the shift-invariance requirement is not important). Putting 
the above together, we try to optimize the following objective function in practice: 

Q(7,C7) =E,.inf 

5 Relationship to Other Methods 

Our work is related to several existing approaches in the literature of machine learning and statistics. 
The first class of them is nonlinear manifold learning, such as LLE [5], Isomap [B], and Laplacian 
Eigenmaps [T]. These methods find global coordinates of data manifold based on a pre-computed 
affinity graph of data points. The use of affinity graphs requires expensive computation and lacks a 
coherent way of generalization to new data. Our method learns a compact set of bases to form local 
coordinates, which has a linear complexity with respect to data size and can naturally handle unseen 
data. More importantly, local coordinate coding has a direct connection to function approximation 
on manifold, and thus provides a sound unsupervised pre-training method to facilitate further 
supervised learning tasks. 

Another set of related models are local models in statistics, such as local kernel smoothing and 
local regression, both traditionally using fixed-bandwidth kernels. Local kernel smoothing can be re- 
garded as a zero-order method; while local regression is higher-order, including local linear regression 
as the Ist-order case. Traditional local methods are not widely used in machine learning practice, 
because data with non-uniform distribution on the manifold require to use adaptive-bandwidth ker- 
nels. The problem can be somehow alleviated by using ii'-nearest neighbors. However, adaptive 
kernel smoothing still suffers from the high-dimensionality and noise of data. On the other hand, 
higher-order methods are computationally expensive and prone to overfitting, because they are 



X 



lv\\\V 



x\\ 
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highly flexible in locally fitting many segments of data in high-dimensional spaces. Our method can 
be seen as a generalized Ist-order local method with basis learning and adaptive locality. Compared 
to local linear regression, the learning is achieved by fitting a single globally linear function with 
respect to a set of learned local coordinates, which is much less prone to overfitting and compu- 
tationally much cheaper. This means that our method achieves better balance between local and 
global aspects of learning. The importance of such balance has been recently discussed in [7j. 

Finally, local-coordinate coding draws connections to vector quantization (VQ) coding and sparse 
coding, which have been widely applied in processing of sensory data, such as acoustic and image 
signals. Learning linear functions of VQ codes can be regarded as a generalized zero-order local 
method with adaptive basis learning. Our method has an intimate relationship with sparse coding. 
In fact, we can regard local coordinate coding as locally constrained sparse coding. Inspired by 
biological visual systems, people has been arguing sparse features of signals are useful for learning 
[1]. However, to the best of our knowledge, there is no analysis in the literature that directly 
answers the question why sparse codes can help learning nonlinear functions in high dimensional 
spaces. Our work reveals an important finding — a good first-order approximation to nonlinear 
function requires the codes to be local, which consequently requires the codes to be sparse. However, 
sparsity does not always guarantee locality conditions. Our experiments demonstrate that sparse 
coding is helpful for learning only when the codes are local. Therefore locality is more essential for 
coding, and sparsity is a consequence of such a condition. 

Properties of related methods discussed in this section are compared in Table [TJ 



method 


dimension 


basis learning 


approximation power 


kernel smoothing 


low 


no 


0th order 


local linear regression 


low 


no 


1st order 


i('-nearest neighbor 


high 


no 


0th order 


Vector Quantization (VQ) 


high 


yes 


0th order 


Local Coordinate Coding (LCC) 


high 


yes 


1st order 



Table 1: Comparison of Related Methods 



6 Experiments 

We use three experiments to demonstrate various points of the theoretical claims. In particular, the 
importance of coding locality and the robustness of various methods to high data dimensionality. 

6.1 Synthetic Data 

Our first experiment is based on a synthetic data set, where a nonlinear function is defined on 
a Swiss-roll manifold, as shown in Figure [l]-(l). The primary goal is to demonstrate the perfor- 
mance of nonlinear function learning using simple linear ridge regression based on representations 
obtained from traditional sparse coding and the newly suggested local coordinate coding, which are, 
respectively, formulated as the following, 

min J^i||x-7(x)f + /3 5]|7.(x)| + A5^||t;f (3) 



7 



lk-7(a;)f + /?y]l7t;(2;)ll|v-a;f + AV||wf (4) 

7,G — Z — — 

where 7(x) = '^y(zc7v{x)v. We note that Q is an approximation to the original formulation, 
mainly for the simplicity of computation. 

We randomly sample 50, 000 data points on the manifold for unsupervised basis learning, and 
500 labeled points for supervised regression. The number of bases is fixed to be 128. The learned 
nonlinear functions are tested on another set of 10, 000 data points, with their performances evalu- 
ated by root mean square error (RMSE). 

In the first setting, we let both coding methods use the same set of fixed bases, which are 128 
points randomly sampled from the manifold. The regression results are shown in Figure [l}(2) and 
(3), respectively. Sparse coding based approach fails to capture the nonlinear function, while local 
coordinate coding behaves much better. We take a closer look at the data representations obtained 
from the two different encoding methods, by visualizing the distributions of distances from encoded 
data to bases that have positive, negative, or zero coefficients in Figure [2] It shows that sparse 
coding lets bases faraway from the encoded data have nonzero coefficients, while local coordinate 
coding allows only nearby bases to get nonzero coefficients. In other words, sparse coding on this 
data does not ensure a good locality and thus fails to facilitate the nonlinear function learning. 
As another interesting phenomenon, local coordinate coding seems to encourage coefficients to be 
nonnegative, which is intuitively understandable — if we use several bases close to a data point to 
linearly approximate the point, each basis should have a positive contribution. However, whether 
there is any merit by explicitly enforcing non-negativity will remain an interesting future work. 

In the next, given the random bases as a common initialization, we let the two algorithms learn 
bases from the 50, 000 unlabeled data points. The regression results based on the learned bases are 
depicted in Figure [ij (4) and (5), which indicate that regression error is further reduced for local 
coordinate coding, but remains to be high for sparse coding. We also make a comparison with 
local kernel smoothing, which takes a weighted average of function values of JC-nearest training 
points to make prediction. As shown in Figure [l]- (6), the method works very well on this simple 
low-dimensional data, even outperforming the local coordinate coding approach. However, if we 
increase the data dimensionality to be 256 by adding 253-dimensional independent Gaussian noises 
with zero mean and unitary variance, local coordinate coding becomes superior to local kernel 
smoothing, as shown in Figure [l]- (7) and (8). This is consistent with our theory, which suggests 
that local coordinate coding can work well in high dimension; on the other hand, local kernel 
smoothing is known to suffer from high dimensionality and noise. 

6.2 Handwritten Digit Recognition 

Our second experiment is based on the MNIST handwritten digit recognition benchmark, where 
each data point is a 28 x 28 gray image, and pre-normalized into a unitary 784-dimensional vector. 
In our setting, the set C of anchor points is obtained from sparse coding, whose formulation follows 
(|3]), with the regularization on v replaced by inequality constraints < 1. Our focus here is not 
on anchor point learning, but rather on checking whether a good nonlinear classifier can be obtained 
if we enforce sparsity and locality in data representation, and then apply simple one-against-call 
Hnear SVMs. 

Since the optimization cost of sparse coding is invariant under flipping the sign of v, we take 
a post-processing step to change the sign of v if we find the corresponding 7t,(x) for most of x is 
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(1) A nonlinear function (2) RMSE=4.394 (3) RMSE=0.499 (4)RMSE=4.661 




(5) RMSE=0.201 (6) RMSE=0.109 (7) RMSE=0.669 (8) RMSE=1.170 

Figure 1: Experiments of nonlinear regression on the Swiss-roll data: (1) a nonlinear function on 
the Swiss-roll manifold, where the color indicates function values; (2) result of sparse coding with 
fixed random anchor points; (3) result of local coordinate coding with fixed random anchor points; 
4) result of sparse coding; (5) result of local coordinate coding; (6) result of local kernel smoothing; 
(7) result of local coordinate coding on noisy data; (8) result of local kernel smoothing on noisy 
data. 



negative. This rectification will ensure the anchor points to be on the data manifold. One example 
of C is visualized in Figure |3] where the number of anchor points is \C\ = 512. With the obtained 
C, for each data point x we solve the local coordinate coding problem by optimizing 7 only, 
to obtain the representation [^v{x)\y^c- In the experiments we try different sizes of bases. The 
classification error rates are provided in Table [2] In addition we also compare with linear classifier 
on raw images, local kernel smoothing based on K-nearest neighbors, and linear classifiers using 
representations obtained from various unsupervised learning methods, including auto-encoder based 
on deep belief networks, Laplacian eigenmaps [Ij, and VQ coding based on i^-means. We note that, 
like most of other manifold learning approaches, Laplacian eigenmaps is a transductive method 
which has to incorporate both training and testing data in training. The comparison results are 
summarized in Table [3] Both sparse coding and local coordinate coding perform quite good for 
this nonlinear classification task, significantly outperforming linear classifiers on raw images. In 
addition, local coordinate coding is consistently better than sparse coding across various basis sizes. 
We further check the locality of both representations by plotting Figure-|4j where the basis number 
is 512, and find that sparse coding on this data set happens to be quite local — unlike the case 
of Swiss-roll data — here only a small portion of nonzero coefficients (again mostly negative) are 
assigned onto the bases whose distances to the encoded data exceed the average of basis-to-datum 
distances. This locafity explains why sparse coding works well on MNIST data. On the other hand, 
local coordinate coding is able to remove the unusual coefficients and further improve the locality. 
Among those compared methods in Table |3] we note that the error rate 1.2% of deep belief network 
reported in ^ was obtained via unsupervised pre-training followed by supervised back-propagation. 
The error rate based on unsupervised training of deep belief networks is about 1.90%Q Therefore our 

^This is obtained via a personal communication with Ruslan Salakhutdinov at University of Toronto. 
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Figure 2: Coding locality on Swiss roll: (a) sparse coding vs. (b) local coordinate coding 
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Figure 3: The anchor points of MNIST digits (|C| = 512). 



result is competitive to the-state-of-the-art results that are based on unsupervised feature learning 
plus linear classification without using additional image geometric information. 



Table 2: Error rates (%) of MNIST classification with different |C|. 





512 


1024 


2048 


4096 


Linear SVM with sparse coding 


2.96 


2.64 


2.16 


2.02 


Linear SVM with local coordinate coding 


2.64 


2.44 


2.08 


1.90 



Table 3: Error rates (%) of MNIST classification with different methods. 



Methods 


Error Rate 


Linear SVM with raw images 


12.0 


Local kernel smoothing 


3.48 


Linear SVM with Laplacian eigenmap 


2.73 


Linear SVM with VQ coding 


3.98 


Linear classifier with deep belief network 


1.90 


Linear SVM with sparse coding 


2.02 


Linear SVM with local coordinate coding 


1.90 
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Figure 4: Coding locality on MNIST: (a) sparse coding vs. (b) local coordinate coding. 
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6.3 Object Recognition 



Our third experiment is image classification based on coding of local patches. We use the the 
Caltech-101 benchmark, which contains 9144 images covering 101 classes (including animals, vehi- 
cles, flowers, etc.) of objects and one additional background class. The visual patterns within each 
class have a high degree of variations in translation, deformation, scale, and rotation, which requires 
a data coding strategy different from that for MNIST images. Basically, instead of coding on entire 
images, successful approaches first perform coding on local patches, and then pool local codes to 
obtain image-level representations. The current state-of-the-art method [3] on Caltech-101 takes 
the following steps: (1) extraction of SIFT descriptors from local patches at a grid of locations in 
an image; (2) VQ coding of each SIFT descriptor; (3) average pooling of codes at different locations 
and scales; (4) classification using a nonlinear SVM with Chi-square kernel. Here we will examine a 
different method that replaces VQ coding by sparse coding or local coordinate coding, and applies 
simple linear SVMs for classification. 

We follow the common experiment setup for Caltech-101, i.e., training on 30 images per category 
and testing on the rest, and randomly repeat the experiments for 10 times. The step size of the 
grid is 8 pixels, and each SIFT is extracted on a 24 x 24 patch centered at a location; we use 
200, 000 random patches' SIFT descriptors to train the bases for VQ and sparse coding; each image 
is partitioned into 1 x 1, 2 x 2, and 4x4 blocks in 3 different scales, and pooling is done within 
each of the 21 blocks. In addition to average pooling, we also try max pooling, i.e., computing the 
max value of each dimension of codes in each block. Finally the pooled codes are concatenated to 
form a single image-level feature vector. The results are presented in Table |4| Our methods using 
sparse coding and local coordinate coding both achieve much higher accuracies on this popular 
benchmark. Since only linear classifiers are required, the methods are much more scalable and 
efficient for training and testing, compared to those state-of-the-art methods relying on nonlinear 
SVMs. We note that local coordinate coding does not produce better results than sparse coding in 
this experiments. This is because sparse coding in this case is already sufficiently local, as illustrated 
in Figure |5} This result again is consistent with the main point of the paper, that is, coding locality 
is essential (and sufficient) for ensuring a good nonlinear learning performance. 



Table 4: Classification rate (%) comparison on Caltech-101. 

Methods Accuracy 

VQ coding, average pooling, linear SVM 58.81 ± 1.51 

VQ coding, average pooling, nonlinear SVM 63.99 ± 0.88 

Sparse coding, average pooling, linear SVM 66.68 ± 0.66 

Sparse coding, max pooling, linear SVM 73.20 ± 0.54 

Local coordinate coding, average pooling, linear SVM 66.72 it 0.52 

Local coordinate coding, max pooling, linear SVM 73.14 it 0.48 



7 Conclusion 

This paper introduces a new method for high dimensional nonlinear learning with data distributed 
on manifolds. The method can be seen as generalized local linear function approximation, but can 
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Figure 5: Coding locality on Caltech-101: (a) sparse coding vs. (b) local coordinate coding 
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be achieved by learning a global linear function with respect to coordinates from unsupervised local 
coordinate coding. Compared to popular manifold learning methods, our approach can naturally 
handle unseen data and has a linear complexity with respect to data size. The work also generalizes 
popular VQ coding and sparse coding schemes, and reveals that locality of coding is essential for 
supervised function learning. The generalization performance obtained in the paper depends on 
intrinsic dimensionality of the data manifold. The experiments on synthetic data, handwritten digit 
data, and object-class images further confirm the findings of our analysis. 

While some results in the paper explicitly rely on the manifold concept, the main idea is more 
general than manifold learning. The theory is valid even when the data do not lie on a manifold. 
In fact, the manifold structure is only used to bound the complexity of the local coordinate coding 
scheme, while the general theory can still be applied if we estimate the complexity using other 
means. 

Finally, it is worth mentioning that on many real data, sparse coding (without locality constraint) 
automatically produces coding schemes that are nearly local. This explains the practical success 
of sparse coding. It remains an interesting question to investigate conditions under which sparse 
codes are local, since such conditions directly imply the effectiveness of sparse coding according to 
our theory. 



8 Proofs 



8.1 Proof of Proposition 2.1 



Consider a change of the R'^ origin by u e M'^, which shifts any point x G M°' to x + u, and points 
V £ C to V + u. The shift-invariance requirement implies that after the change, we map x + u to 
^^g^7^(x)f + n, which should equal J2vec '")• This is equivalent to n = X^tiec (^)^' 

which holds if and only if J2vec 'Ivi^) = 1. 

8.2 Proof of Lemma [2Jl 

For simplicity, let 7^, = 'jvix) and x' = j{x) = J^vec^^''^- have 

\fix)-^7vf{v)\ 
vec 

<\f{x)-f{x')\ + 



-\f{x)-f{x')\ + 



Y.y"U{v)-f{x')) 

v&C 

Y.lvU{v)-f{x')-Vf{x'Y{v-x')) 

vac 

<|/(x) - + - /(^') - - x'))| 

vac 

<a\x — x'||2 + /? l7t)| ll^^' ~ ^11^^^- 
This implies the bound. 



vac 
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8.3 Proof of Theorem [2TT] 

Given any e > 0, consider an e-cover C of M with \C'\ < Af{e,M). Given each u e C, define 
Cu = {vi{u), ■ ■ ■ ,Vd{u)}, where Vj{u) are defined in Definition 2.4 Define the anchor points as 

C = UueC'{u + Vj{u) : j = 1, . . . , m} U C'. 

It follows that \C\ < (1 + m)M{e,M). 

In the following, we only need to prove the existence of a coding 7 on that satisfies the 
requirement of the theorem. Without loss of generality, we assume that ||wj('u)|| = e for each u and 
j, and given {vj{u) : j = 1, . . . ,m} are orthogonal with respect to A: vj {u)Avk{u) = when 

For each x G M, let Ux G C be the closest point to x in C". We have ||x — Ux\\ < e by the 
definition of C. Now, Definition 2.4 implies that there exists 7^(2;) (j = 1, . . . ,m) such that 



x-ux-J2'yj(^)^j( 



< Cp{M)e 



i+p 



The optimal choice is the A-projection of x — Ux to the subspace spanned by {vj{ux) : j = I, . . . , m}. 
The orthogonality condition thus implies that 



5^7j(x)2||T;jK)||2 < \\x-Uxf < e^. 

Therefore 

m 

which implies that for all x: 

m 

We can now define the coordinate coding of x G 7W as 



This implies the following bounds: 



i-Er=i7K-) 




V = Ux+ Vj{Ux) 
V = Ux 

otherwise 



|x-7(x)|| < Cp{M)e^+P 
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and 



m 



<{l + ^)e'+P + Y,\^'^{x)\{e + e)'+P 
=[1 + V^ + 2i+PV^]e^+P, 



where we have used ||f — Ux\\ = e, and ||7(x) — Ux\\ < \\x — Ux\\ < e (note that 7(0;) — Ux is the 
projection of x — Ux)- 

8.4 Proof of Theorem IsTTl 

Consider n + 1 samples S'^+i = {(xi, yi), . . . , y„+i)}. We shall introduce the following nota- 

tion: 



[Wv\ = arg mm 



n+1 



-J2^ (f-f'Ciw, Xi),yi) + XYwl 

i=l v(^C 

Let k be an integer randomly drawn from {1, . . . , n + 1}. Let [wi^^] be the solution of 

- Y (f>{f'i,c{w,Xi),yi) + 



(5) 



[Wv] 



i=l,...,n+l;j^fc 



with the k-th. example left-out. 

We have the following stability lemma from |H], which can be stated as follows using our termi- 
nology: 



Lemma 8.1 The following inequality holds 

\fy,c{w''''\xk) - f^,c{w,Xk)\ < 



2 

T I J,/ 



2Xn 



'iUi,c{w,Xk),yk)\- 



By using Lemma pA] we obtain for all a > 0: 

<P{f'y,c{w,Xk),yk) - cl){f'y,c{w'^''\xk),yk) 
=<P{f'y,c{w,Xk),yk) - cl){f^,c{w^''\xk),yk) 

- (Pi{f-f,c{w''''\xk),yk){f-f,c{w,Xk) - f'y,c{w^''\xk)) 

+ (p'l{fy,c{w'^''\xk),yk){f'y,c{w,Xk) - f^,c{w'^^\xk)) 

>(t>'i{f^,c{w^^\xk),yk){f-i,c{w, Xk) - f^,c{w^''\xk)) 
> - M^,c{w^^\xk),yk?\\xk\\y{2Xn) 
>-B'\\xk\\'i/{2Xn). 
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In the above derivation, the first inequality uses the convexity of (/)(/, y) with respect to /, which 
implies that 4'{fi-,y) — 0(/2, y) — (t>'i{f2,y){fi — fi) > 0. The second inequality uses Lemma 8.1 and 
the third inequality uses the assumption of the loss function. 

Now by summing over k, and consider any fixed / E ^a,f3,p, we obtain: 

n+l 

'^Hfi,c{w^''\xk),yk) 

k=l 
n+l 



k=l 



4>{f-y,c{w,xk),yk) + 



^11 l|2 



<n 



<n 



^ k=l \veC / v£C 

, n+l 

- Et*^ (f(^k),yk) + BQ{xk)] + A ^ fivf 



+ 



B 



k=l 



+ 



2 An 
B 

2An 



2 "+1 

E'l^fci 

k=l 
2 "+1 



E 

fc=i 



where Q{x) = a\\x — 7(x)|| + P'^y^c 11^ ~ tC^)!!^"*"^- the above derivation, the second 

inequality follows from the definition of w as the minimizer of ^ . The third inequality follows from 
Lemma [2!T| Now by taking expectation with respect to Sn+i, we obtain 



(n + l)E5„+i(/>(/7,c(^^^"+'\ y„+i 
<n 



—E,,ycP (/(x) , y) + !i±li?Q,_^^p(^, C) + A ^ f{i 



+ 



B^(n + 1) „ „^ 

-^Mr 

2Xn ' 



This implies the desired bound. 
8.5 Proof of Theorem [372] 

Note that any measurable function f : M. ^ R can be approximated by J^a,p,p with a, /3 — > oo and 
p = 0. Therefore we only need to show 

lim '&s,,Ex,y(t>Ui,ciw,x),y) = lim inf E^,j^0 (/(x), y) . 



Theorem 2.1 implies that it is possible to pick (7, C) such that |C|/n ^ and Qa,i3,p{l, C) 0. 
Moreover, ||x||^ is bounded. 

Given any / € ^a,p,o and any constant A > that is independent of n; if we let /a(x) = 
max(min(/(x), A), — ^), then it is clear that /yi(x) G JFq, q,+^_o- Therefore Theorem 3.1 implies that 

as n ^ 00, 

Es„ Ex,y^{f^^c{w, x),y)< Ex,y4> {fA{x),y) + o(l). 
Since A is arbitrary, we let A —> 00 to obtain the desired result. 
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